Take-home Exercise 3

Putting Visual Analysis into Practical Use

LI HONGYI (SMU SCIS)https://scis.smu.edu.sg
2022-05-08

Task Overview

In this take-home exercise, static and interactive methods should be used properly to explore the health of employers according to the data gathered from participants in Ohio, USA.

Two main questions are explored:

  1. The health of employers: The distribution of jobs and education requirement.

  2. The prosperity of business: The population of visitors throughout the study period.

Getting Started

Installing and Lauching Packages

The following code chunk installs and lauchs packages we need.

packages <- c('ggiraph', 'plotly', 
             'DT', 'patchwork',
             'gganimate', 'tidyverse',
             'readxl', 'gifski', 'gapminder',
             'treemap', 'treemapify',
             'rPackedBar', 'lubridate')

for (p in packages){
  if (!require(p,character.only=T)){
    install.packages(p)
  }
  library(p, character.only=T)
}

Importing Data

jobs <- read_csv("data/Jobs.csv")
employers <- read_csv("data/Employers.csv")
CheckinJournal <- read_csv("data/CheckinJournal.csv")

Employment Patterns

The analysis is based on employer level: the distribution of employers on map, Which employers have more jobs, which employers require higher education level, which employers provide higher salary etc.

Data Preparetion

Firstly, we join table employers and jobs by key “employerId”.

joined_table <- jobs %>%
  left_join(y=employers, by = c("employerId" = "employerId"))

joined_table_only10 <- joined_table[!(joined_table$hourlyRate>=11),]
joined_table_without10 <- joined_table[!(joined_table$hourlyRate<11),]

Education Requirement

The histogram shows us the distribution of jobs with different education requirement.

ggplot(data=joined_table,
aes(x=educationRequirement)) +
geom_bar() +
  ylab("Jobs") +
  theme(axis.title.y=element_text(angle = 0))

Hourly Rate (>11) showing employerId and buildingId

The number of jobs with hourly rate = 10 is too large and cannot be included in the following graph. Thus, the following two graph ignores the jobs with hourly rate = 10 and these jobs will be evaluated individually next.

tooltip_css <- "background-color:white;
font-style:bold; color:black;"

joined_table_without10$tooltip <- c(paste0(
  "Building ID = ", joined_table_without10$buildingId,
  "\n Employer ID = ", joined_table_without10$employerId))
p <- ggplot(data=joined_table_without10, 
       aes(x = hourlyRate)) +
  geom_dotplot_interactive(
    aes(tooltip = joined_table_without10$tooltip),
    stackgroups = TRUE,
    binwidth = 0.8,
    method = "histodot") +
  coord_cartesian(xlim=c(10,100)) +
  scale_y_continuous(NULL,               
                     breaks = NULL)
girafe(
  ggobj = p,
  width_svg = 6,
  height_svg = 4,
  options = list(
    opts_tooltip(
      css = tooltip_css))
)

Hourly Rate (>11) with different educational requirement

p <- ggplot(data=joined_table_without10, 
       aes(x = hourlyRate)) +
  geom_dotplot_interactive(              
    aes(tooltip = educationRequirement,
        data_id = educationRequirement),
    stackgroups = TRUE,                  
    binwidth = 0.8,                        
    method = "histodot") +               
  scale_y_continuous(NULL,               
                     breaks = NULL)
girafe(                                  
  ggobj = p,                             
  width_svg = 6,                         
  height_svg = 4,
  options = list(                        
    opts_hover(css = "fill: #202020;"),  
    opts_hover_inv(css = "opacity:0.2;") 
  )                                        
)

Hourly Rate (=10)

ggplot(data=joined_table_only10,
aes(x=educationRequirement)) +
geom_bar() +
  ylab("Jobs") +
  theme(axis.title.y=element_text(angle = 0))

The prosperity of business (Pubs & Restaurants)

Data Preparation

First, from CheckinJournal.csv, we extract Pubs and Restaurants.

Then, we generate the day counted from the start date of gathering this dataset (2022-3-1) and month (1st-15th).

Then, we rotate Pubs and Restaurants into columns and calculate population of people check in pubs and restaurants respectively and the population of people in total.

CheckinJournal_buz <- CheckinJournal[(CheckinJournal$venueType=='Pub'| CheckinJournal$venueType=='Restaurant'),]

CheckinJournal_buz$yday <-yday(CheckinJournal_buz$timestamp-59)
CheckinJournal_buz <- subset(CheckinJournal_buz, select = -c(timestamp,venueId) )
CheckinJournal_buz_gb <- CheckinJournal_buz %>% group_by(yday, venueType) %>% summarise(population = n())

CheckinJournal_buz_gb_1 <- CheckinJournal_buz_gb %>%
  mutate(i = row_number()) %>%
  spread(venueType, population) %>%
  select(-i)

CheckinJournal_buz_gb_1[is.na(CheckinJournal_buz_gb_1)] <- 0
CheckinJournal_buz_gb_final <- CheckinJournal_buz_gb_1 %>%
  group_by(yday) %>% 
  summarise(Pub=sum(Pub),Restaurant=sum(Restaurant))%>% 
  mutate(Population=Pub + Restaurant )
CheckinJournal_buz_gb_final$month <- cut(CheckinJournal_buz_gb_final$yday, 
breaks = c(0,31,61,92,123,154,184,                              215,245,276,307,335,366,396,421,500),
labels=c('1st','2nd','3rd','4th','5th','6th','7th','8th','9th',  '10th','11th','12th','13th','14th','15th'))
head(CheckinJournal_buz_gb_final)
# A tibble: 6 x 5
   yday   Pub Restaurant Population month
  <dbl> <int>      <int>      <int> <fct>
1     1  1214       1032       2246 1st  
2     2   609        982       1591 1st  
3     3   361        963       1324 1st  
4     4   393        983       1376 1st  
5     5   496        957       1453 1st  
6     6   476        960       1436 1st  

Population of Pubs vs Restaurants in each months

Noted, the No. of days means the day from 2022-3-1, and the month 1st = March.

plot_ly(data = CheckinJournal_buz_gb_final, 
        x = ~Pub, 
        y = ~Restaurant,
        text = ~paste("No. of days:", yday),
        color = ~month, 
        colors = "Set1")%>%
  layout(title = 'Population of Pubs vs Restaurants ',
         xaxis = list(range = c(0, 5800)),
         yaxis = list(range = c(500, 2700)))